I have decided to do this project on a topic that concerns me greatly and makes me very sad. Hopefully some light can be thrown on this topic. The World Health Organization reported every 40 seconds a person somewhere in the world commits suicide. Despite this outrageously high statistic, WHO said only a handful of countries have policies aimed at suicide prevention. Source:https://www.who.int/
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(gapminder)
library(ggthemes)
library(ggpubr)
## Loading required package: magrittr
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggpubr':
##
## get_legend
## The following object is masked from 'package:ggthemes':
##
## theme_map
## The following object is masked from 'package:ggplot2':
##
## ggsave
library(grid)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(viridisLite)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(DT)
options(scipen=999)
m<-read.csv("master.csv")
str(m)
## 'data.frame': 27820 obs. of 12 variables:
## $ ï..country : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
## $ age : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
## $ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides.100k.pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country.year : Factor w/ 2321 levels "Albania1987",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HDI.for.year : num NA NA NA NA NA NA NA NA NA NA ...
## $ gdp_for_year.... : Factor w/ 2321 levels "1,002,219,052,968",..: 727 727 727 727 727 727 727 727 727 727 ...
## $ gdp_per_capita....: int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
glimpse(m)
## Observations: 27,820
## Variables: 12
## $ ï..country <fct> Albania, Albania, Albania, Albania, Albania...
## $ year <int> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1...
## $ sex <fct> male, male, female, male, male, female, fem...
## $ age <fct> 15-24 years, 35-54 years, 15-24 years, 75+ ...
## $ suicides_no <int> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 0, 0, 2, 1...
## $ population <int> 312900, 308000, 289700, 21800, 274300, 3560...
## $ suicides.100k.pop <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1...
## $ country.year <fct> Albania1987, Albania1987, Albania1987, Alba...
## $ HDI.for.year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ gdp_for_year.... <fct> "2,156,624,900", "2,156,624,900", "2,156,62...
## $ gdp_per_capita.... <int> 796, 796, 796, 796, 796, 796, 796, 796, 796...
## $ generation <fct> Generation X, Silent, Generation X, G.I. Ge...
sum(complete.cases(m))
## [1] 8364
sum(is.na(m))
## [1] 19456
summary(m)
## ï..country year sex age
## Austria : 382 Min. :1985 female:13910 15-24 years:4642
## Iceland : 382 1st Qu.:1995 male :13910 25-34 years:4642
## Mauritius : 382 Median :2002 35-54 years:4642
## Netherlands: 382 Mean :2001 5-14 years :4610
## Argentina : 372 3rd Qu.:2008 55-74 years:4642
## Belgium : 372 Max. :2016 75+ years :4642
## (Other) :25548
## suicides_no population suicides.100k.pop
## Min. : 0.0 Min. : 278 Min. : 0.00
## 1st Qu.: 3.0 1st Qu.: 97498 1st Qu.: 0.92
## Median : 25.0 Median : 430150 Median : 5.99
## Mean : 242.6 Mean : 1844794 Mean : 12.82
## 3rd Qu.: 131.0 3rd Qu.: 1486143 3rd Qu.: 16.62
## Max. :22338.0 Max. :43805214 Max. :224.97
##
## country.year HDI.for.year gdp_for_year....
## Albania1987: 12 Min. :0.483 1,002,219,052,968: 12
## Albania1988: 12 1st Qu.:0.713 1,011,797,457,139: 12
## Albania1989: 12 Median :0.779 1,016,418,229 : 12
## Albania1992: 12 Mean :0.777 1,018,847,043,277: 12
## Albania1993: 12 3rd Qu.:0.855 1,022,191,296 : 12
## Albania1994: 12 Max. :0.944 1,023,196,003,075: 12
## (Other) :27748 NA's :19456 (Other) :27748
## gdp_per_capita.... generation
## Min. : 251 Boomers :4990
## 1st Qu.: 3447 G.I. Generation:2744
## Median : 9372 Generation X :6408
## Mean : 16866 Generation Z :1470
## 3rd Qu.: 24874 Millenials :5844
## Max. :126352 Silent :6364
##
names(m)
## [1] "ï..country" "year" "sex"
## [4] "age" "suicides_no" "population"
## [7] "suicides.100k.pop" "country.year" "HDI.for.year"
## [10] "gdp_for_year...." "gdp_per_capita...." "generation"
m<-rename(m, "country"="ï..country","gdp.c"="gdp_per_capita....","gdp.y"="gdp_for_year....")
I started to clean the data then I thought “I should probably check how many countries are in this set”
select(m,country) %>% unique %>% nrow
## [1] 101
unique(m$country)
## [1] Albania Antigua and Barbuda
## [3] Argentina Armenia
## [5] Aruba Australia
## [7] Austria Azerbaijan
## [9] Bahamas Bahrain
## [11] Barbados Belarus
## [13] Belgium Belize
## [15] Bosnia and Herzegovina Brazil
## [17] Bulgaria Cabo Verde
## [19] Canada Chile
## [21] Colombia Costa Rica
## [23] Croatia Cuba
## [25] Cyprus Czech Republic
## [27] Denmark Dominica
## [29] Ecuador El Salvador
## [31] Estonia Fiji
## [33] Finland France
## [35] Georgia Germany
## [37] Greece Grenada
## [39] Guatemala Guyana
## [41] Hungary Iceland
## [43] Ireland Israel
## [45] Italy Jamaica
## [47] Japan Kazakhstan
## [49] Kiribati Kuwait
## [51] Kyrgyzstan Latvia
## [53] Lithuania Luxembourg
## [55] Macau Maldives
## [57] Malta Mauritius
## [59] Mexico Mongolia
## [61] Montenegro Netherlands
## [63] New Zealand Nicaragua
## [65] Norway Oman
## [67] Panama Paraguay
## [69] Philippines Poland
## [71] Portugal Puerto Rico
## [73] Qatar Republic of Korea
## [75] Romania Russian Federation
## [77] Saint Kitts and Nevis Saint Lucia
## [79] Saint Vincent and Grenadines San Marino
## [81] Serbia Seychelles
## [83] Singapore Slovakia
## [85] Slovenia South Africa
## [87] Spain Sri Lanka
## [89] Suriname Sweden
## [91] Switzerland Thailand
## [93] Trinidad and Tobago Turkey
## [95] Turkmenistan Ukraine
## [97] United Arab Emirates United Kingdom
## [99] United States Uruguay
## [101] Uzbekistan
## 101 Levels: Albania Antigua and Barbuda Argentina Armenia ... Uzbekistan
This data set is missing a lot most importanly china india and at least 90 countries.
m1<- m%>%
group_by(country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
m1
## # A tibble: 101 x 2
## country total_suicides
## <fct> <int>
## 1 Russian Federation 1209742
## 2 United States 1034013
## 3 Japan 806902
## 4 France 329127
## 5 Ukraine 319950
## 6 Germany 291262
## 7 Republic of Korea 261730
## 8 Brazil 226613
## 9 Poland 139098
## 10 United Kingdom 136805
## # ... with 91 more rows
I thought to check my data set against another data set. I chose the nations of the world data set which I got from kaggle.
nations <- read_csv("nations.csv")
## Parsed with column specification:
## cols(
## iso2c = col_character(),
## iso3c = col_character(),
## country = col_character(),
## year = col_double(),
## gdp_percap = col_double(),
## population = col_double(),
## birth_rate = col_double(),
## neonat_mortal_rate = col_double(),
## region = col_character(),
## income = col_character()
## )
select(nations,country) %>% unique %>% nrow
## [1] 211
unique(nations$country)
## [1] "Andorra" "United Arab Emirates"
## [3] "Afghanistan" "Antigua and Barbuda"
## [5] "Albania" "Armenia"
## [7] "Angola" "Argentina"
## [9] "American Samoa" "Austria"
## [11] "Australia" "Aruba"
## [13] "Azerbaijan" "Bosnia and Herzegovina"
## [15] "Barbados" "Bangladesh"
## [17] "Belgium" "Burkina Faso"
## [19] "Bulgaria" "Bahrain"
## [21] "Burundi" "Benin"
## [23] "Bermuda" "Brunei Darussalam"
## [25] "Bolivia" "Brazil"
## [27] "Bahamas, The" "Bhutan"
## [29] "Botswana" "Belarus"
## [31] "Belize" "Canada"
## [33] "Congo, Dem. Rep." "Central African Republic"
## [35] "Congo, Rep." "Switzerland"
## [37] "Cote d'Ivoire" "Chile"
## [39] "Cameroon" "China"
## [41] "Colombia" "Costa Rica"
## [43] "Cuba" "Curacao"
## [45] "Cyprus" "Czech Republic"
## [47] "Germany" "Djibouti"
## [49] "Denmark" "Dominica"
## [51] "Dominican Republic" "Algeria"
## [53] "Ecuador" "Estonia"
## [55] "Egypt, Arab Rep." "Eritrea"
## [57] "Spain" "Ethiopia"
## [59] "Finland" "Fiji"
## [61] "Micronesia, Fed. Sts." "France"
## [63] "Gabon" "United Kingdom"
## [65] "Grenada" "Georgia"
## [67] "Ghana" "Gibraltar"
## [69] "Greenland" "Gambia, The"
## [71] "Guinea" "Equatorial Guinea"
## [73] "Greece" "Guatemala"
## [75] "Guam" "Guinea-Bissau"
## [77] "Guyana" "Hong Kong SAR, China"
## [79] "Honduras" "Croatia"
## [81] "Haiti" "Hungary"
## [83] "Indonesia" "Ireland"
## [85] "Israel" "Isle of Man"
## [87] "India" "Iraq"
## [89] "Iran, Islamic Rep." "Iceland"
## [91] "Italy" "Channel Islands"
## [93] "Jamaica" "Jordan"
## [95] "Japan" "Kenya"
## [97] "Kyrgyz Republic" "Cambodia"
## [99] "Kiribati" "Comoros"
## [101] "St. Kitts and Nevis" "Korea, Rep."
## [103] "Kuwait" "Cayman Islands"
## [105] "Kazakhstan" "Lao PDR"
## [107] "Lebanon" "St. Lucia"
## [109] "Liechtenstein" "Sri Lanka"
## [111] "Liberia" "Lesotho"
## [113] "Lithuania" "Luxembourg"
## [115] "Latvia" "Libya"
## [117] "Morocco" "Monaco"
## [119] "Moldova" "Montenegro"
## [121] "St. Martin (French part)" "Madagascar"
## [123] "Marshall Islands" "Macedonia, FYR"
## [125] "Mali" "Myanmar"
## [127] "Mongolia" "Macao SAR, China"
## [129] "Northern Mariana Islands" "Mauritania"
## [131] "Malta" "Mauritius"
## [133] "Maldives" "Malawi"
## [135] "Mexico" "Malaysia"
## [137] "Mozambique" "Namibia"
## [139] "New Caledonia" "Niger"
## [141] "Nigeria" "Nicaragua"
## [143] "Netherlands" "Norway"
## [145] "Nepal" "New Zealand"
## [147] "Oman" "Panama"
## [149] "Peru" "French Polynesia"
## [151] "Papua New Guinea" "Philippines"
## [153] "Pakistan" "Poland"
## [155] "Puerto Rico" "West Bank and Gaza"
## [157] "Portugal" "Palau"
## [159] "Paraguay" "Qatar"
## [161] "Romania" "Serbia"
## [163] "Russian Federation" "Rwanda"
## [165] "Saudi Arabia" "Solomon Islands"
## [167] "Seychelles" "Sudan"
## [169] "Sweden" "Singapore"
## [171] "Slovenia" "Slovak Republic"
## [173] "Sierra Leone" "San Marino"
## [175] "Senegal" "Somalia"
## [177] "Suriname" "South Sudan"
## [179] "Sao Tome and Principe" "El Salvador"
## [181] "Sint Maarten (Dutch part)" "Syrian Arab Republic"
## [183] "Swaziland" "Turks and Caicos Islands"
## [185] "Chad" "Togo"
## [187] "Thailand" "Tajikistan"
## [189] "Timor-Leste" "Turkmenistan"
## [191] "Tunisia" "Tonga"
## [193] "Turkey" "Trinidad and Tobago"
## [195] "Tuvalu" "Tanzania"
## [197] "Ukraine" "Uganda"
## [199] "United States" "Uruguay"
## [201] "Uzbekistan" "St. Vincent and the Grenadines"
## [203] "Venezuela, RB" "Virgin Islands (U.S.)"
## [205] "Vietnam" "Vanuatu"
## [207] "Samoa" "Yemen, Rep."
## [209] "South Africa" "Zambia"
## [211] "Zimbabwe"
present problems Still,this data set had the reverse problem of the previous one, with countries being repeated or countries that no longer exsisted or territoires such as St. Vincent and the Grenadines,bieng included in data. So I finally went to the internet and just looked it up. Countries in the World:195 195 which breaks down as follows: 54 countries are in Africa 48 in Asia 44 in Europe 33 in Latin America and the Caribbean 14 in Oceania 2 in Northern America Source:https://www.worldometers.info/geography/how-many-countries-are-there-in-the-world/
General Problems with data * 7 countries had less than 3 years of data total * 2016 data had almost no countries.The countries that were represented often had data missing. * HDI had 2/3 missing data * Generation variable has problems(not ordinal) * Africa has very few countries providing suicide data * Countries that have big population such as China and India are absent from the data. * The lack general lack of countries,there are only 101 out of 196
So quite naturally I took the high road and imported another data set.
who<- read_csv("who-suicide-statistics/who_suicide_statistics.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## sex = col_character(),
## age = col_character(),
## suicides_no = col_double(),
## population = col_double()
## )
str(who)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 43776 obs. of 6 variables:
## $ country : chr "Albania" "Albania" "Albania" "Albania" ...
## $ year : num 1985 1985 1985 1985 1985 ...
## $ sex : chr "female" "female" "female" "female" ...
## $ age : chr "15-24 years" "25-34 years" "35-54 years" "5-14 years" ...
## $ suicides_no: num NA NA NA NA NA NA NA NA NA NA ...
## $ population : num 277900 246800 267500 298300 138700 ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. year = col_double(),
## .. sex = col_character(),
## .. age = col_character(),
## .. suicides_no = col_double(),
## .. population = col_double()
## .. )
glimpse(who)
## Observations: 43,776
## Variables: 6
## $ country <chr> "Albania", "Albania", "Albania", "Albania", "Alban...
## $ year <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 19...
## $ sex <chr> "female", "female", "female", "female", "female", ...
## $ age <chr> "15-24 years", "25-34 years", "35-54 years", "5-14...
## $ suicides_no <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ population <dbl> 277900, 246800, 267500, 298300, 138700, 34200, 301...
sum(complete.cases(who))
## [1] 36060
sum(is.na(who))
## [1] 7716
Already better only a 16th of the data are NA’s.
summary(who)
## country year sex age
## Length:43776 Min. :1979 Length:43776 Length:43776
## Class :character 1st Qu.:1990 Class :character Class :character
## Mode :character Median :1999 Mode :character Mode :character
## Mean :1999
## 3rd Qu.:2007
## Max. :2016
##
## suicides_no population
## Min. : 0.0 Min. : 259
## 1st Qu.: 1.0 1st Qu.: 85113
## Median : 14.0 Median : 380655
## Mean : 193.3 Mean : 1664091
## 3rd Qu.: 91.0 3rd Qu.: 1305698
## Max. :22338.0 Max. :43805214
## NA's :2256 NA's :5460
who1<- who%>%
group_by(country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
head(who1)
## # A tibble: 6 x 2
## country total_suicides
## <chr> <dbl>
## 1 Russian Federation 1500992
## 2 United States of America 1201401
## 3 Japan 937614
## 4 France 395500
## 5 Ukraine 365170
## 6 Germany 291262
sapply(who, function(x) sum(is.na(x)))
## country year sex age suicides_no population
## 0 0 0 0 2256 5460
So,we can see that the na’s are shown in suicides and the population variables.
na_df <- who[is.na(who$suicides_no) | is.na(who$population),]
nrow(na_df)
## [1] 7716
Handling NA values
There are in total 7716 rows that contain missing values which is equal to 17.6% of the whole given dataset. We will try to do some sorting of Na’s in order to figure out the NA values for both variables (Population, Suicide Number) are seen mostly in a specific year or country.This is to see if there is bias in a concentrated form or are the NA’s random and thereby havig less of an overall effect.
na_population <- who[is.na(who$population),]
na_population$country <- factor(na_population$country,
levels = unique(na_population$country))
na_population_by_country <- as.data.frame(table(na_population$country))
colnames(na_population_by_country) <- c('country', 'frequence')
# order data frame by decreasing frequnce
na_population_by_country <- na_population_by_country[order(-na_population_by_country$frequence),]
# order factor so that we can plot in decreasing freqence
na_population_by_country$country <- factor(na_population_by_country$country,
levels = unique(na_population_by_country$country[
order(-na_population_by_country$frequence,
na_population_by_country$country)]))
# plotting na values of population by country in decreasing order
ggplot(data=na_population_by_country, aes(x=country, y=frequence, fill = country)) +
geom_bar(stat="identity", width = 0.3) +
theme(axis.text.x=element_blank()) + ggtitle('NA Population Values per Country')
na_population$year <- factor(na_population$year, levels = unique(na_population$year))
na_population_by_year <- as.data.frame(table(na_population$year))
colnames(na_population_by_year) <- c('year', 'frequence')
na_population_by_year$year <- factor(na_population_by_year$year,
levels = 1978:2016, ordered = T)
ggplot(data=na_population_by_year, aes(x=year, y=frequence, fill = year)) +
geom_bar(stat="identity", width = 0.3) +
theme(axis.text.x=element_blank()) + ggtitle('NA Population values per Year')
na_suicides <- who[is.na(who$suicides_no),]
# we remove levels from the country and year factor that are missing
na_suicides$country <- factor(na_suicides$country, levels = unique(na_suicides$country))
na_suicides_by_country <- as.data.frame(table(na_suicides$country))
colnames(na_suicides_by_country) <- c('country', 'frequence')
# order levels of countries depending on the missing rows
# order data frame by decreasing frequence
na_suicides_by_country <- na_suicides_by_country[order(-na_suicides_by_country$frequence),]
# order factor so that we can plot in decreasing freqence
na_suicides_by_country$country <- factor(na_suicides_by_country$country,
levels = unique(na_suicides_by_country$country[
order(-na_suicides_by_country$frequence,
na_suicides_by_country$country)]))
ggplot(data=na_suicides_by_country, aes(x=country, y=frequence, fill = country)) +
geom_bar(stat="identity", width = 0.3) +
ggtitle('NA suicide_no values per Country') +
theme(axis.text.x=element_blank())
na_suicides$year <- factor(na_suicides$year, levels = unique(na_suicides$year))
na_suicides_by_year <- as.data.frame(table(na_suicides$year))
colnames(na_suicides_by_year) <- c('year', 'frequence')
na_suicides_by_year$year <- factor(na_suicides_by_year$year,
levels = 1978:2016, ordered = T)
ggplot(data=na_suicides_by_year, aes(x=year, y=frequence, fill = year)) +
geom_bar(stat="identity", width = 0.3) +
ggtitle('NA suicide_no Values per Year') +
theme(axis.text.x=element_blank())
na_population_by_age <- as.data.frame(table(na_population$age))
na_population_by_age
## Var1 Freq
## 1 15-24 years 910
## 2 25-34 years 910
## 3 35-54 years 910
## 4 5-14 years 910
## 5 55-74 years 910
## 6 75+ years 910
na_suicides_by_age <- as.data.frame(table(na_suicides$age))
na_suicides_by_age
## Var1 Freq
## 1 15-24 years 376
## 2 25-34 years 376
## 3 35-54 years 376
## 4 5-14 years 376
## 5 55-74 years 376
## 6 75+ years 376
Results
Plotting the NA values by year and by country lead us to the conclusion, that there is no particular connection between the missing data and the variables and they are pretty random. However, there are countries whose corresponding groups have always at least one NA value. (e.g Peru does not have any registered population)
On the other hand, the fact that rows having NA Population are spread equally to the age groups, leads to the conclusion that all data of a year of a specific country should be missing (e.g. there is not a case where the suicided number for Denmark is missing only for the age group 15-24, but instead all age groups of this country for this year have NA value for the suicides variable).
who$suicides_no <- as.numeric(who$suicides_no)
who$population <- as.numeric(who$population)
There are many countries missing per year. We can see that nearly for half of the time period that we examine, we have data for at most 100 countries out of the 141 (so unfortunately not much better than are original data set ) mentioned in total. In order to handle these issues, there are different approaches that we could consider:
1A possible idea would be to complete all the missing rows with NA values and then try to impute/predict all of them. The problem is that on some occasions a lot of data are missing which makes us believe that probably it would be a bad idea to try to predict all of them, as a lot of bias would be added in our predictions. 2The other idea would be to fill in the data by searching the data in the WHO Database https://www.who.int/mental_health/prevention/suicide/countrydata/en/ and other internet resources. 3The third one would be to try to impute just the existing NA values using MICE package. 4The final fourth one would be to try to just keep just the complete cases of the existing dataset.
I think the fouth option is the most diplomatic option so that’s what I’m going to choose. #Exploration and visualizing ## By Country
whoDF_s1 <- who %>%
group_by(country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_s1,aes(x=reorder(country,-total_suicides),y=total_suicides,fill=-total_suicides))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x="Country",y="Count",title="Countrie's Suicides Stats")+
theme(plot.title = element_text(size=15,face="bold"))
whoDF_s3 <- who %>%
group_by(year,country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_s3,aes(x=year,y=total_suicides,fill=-total_suicides))+
geom_col() +
labs(x="Year",y="Count",title="Suicides Worldwide")+
theme(plot.title = element_text(size=15,face="bold"))
Insights Clearly missing Data at the end of the Graph Missing Data in the 80’s Which is because of NA’s from Russia In that period. ## By Age
whoDF_sa <- who %>%
group_by(age) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_sa,aes(x=reorder(age,-total_suicides),y=total_suicides,fill=-total_suicides))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x="Age",y="Count",title=" Age Suicides Stats")+
theme(plot.title = element_text(size=15,face="bold"))
## By Gender
We Can clearly observe that the highest rate of suicide ocurrs at Middle Age. This surprisied me because I assumed the rate would be highest by the eldearly or teenagers.
whoDF_sb <- who %>%
group_by(sex) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_sb,aes(x=reorder(sex,-total_suicides),y=total_suicides,fill=-total_suicides))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x="Gender",y="Count",title="Suicide by Gender")+
theme(plot.title = element_text(size=15,face="bold"))
t.test(suicides_no ~ sex , data = who , alternative = "less")
##
## Welch Two Sample t-test
##
## data: suicides_no by sex
## t = -26.091, df = 24030, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -190.5464
## sample estimates:
## mean in group female mean in group male
## 91.6316 294.9992
We got the significant value as p-value is less than 2.2e-16. Therefore, we can firmly say that male are more likely to commited suicides than female in these time period.
This may be due what Simon Haber vice-chair of research for the Department of Psychiatry at the University of Ottawa says. “Women are actually more likely to try to kill themselves - three to four times more likely. But men are more likely to die from it. That’s a pattern that holds true across Canada, and in most of the rest of the world as well. That’s mainly due to two things:One is that men use more lethal means [to attempt suicide], and the second is that they don’t seek care as much.”
Insights
Comparing the nations with the most Suicide by the numbers.
whoDF_s6 <- who %>%
filter(country == "Russian Federation",year%in%c("1997","2015","1980")) %>%
group_by(sex,age,year) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE))
ggplot(whoDF_s6,aes(x=factor(age,levels = c("5-14 years","15-24 years","25-34 years","35-54 years","55-74 years","75+ years")),y=total_suicides,fill=sex))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
facet_wrap(whoDF_s6$year) +
labs(x="Age",y="Count",title="Suicides in Russia")+
theme(plot.title = element_text(size=15,face="bold"))
whoDF_s7 <- who %>%
filter(country == "United States of America",year%in%c("1997","2015","1980")) %>%
group_by(sex,age,year) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE))
ggplot(whoDF_s7,aes(x=factor(age,levels = c("5-14 years","15-24 years","25-34 years","35-54 years","55-74 years","75+ years")),y=total_suicides,fill=sex))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
facet_wrap(whoDF_s6$year) +
labs(x="Age",y="Count",title="Suicides in USA")+
theme(plot.title = element_text(size=15,face="bold"))
In short In russia Suicides have gone down while in the Usa it has gone up
TC <- who %>%
select(country, year, sex, age, suicides_no, population,) %>%
filter(country %in% c("Russian Federation","United States of America","Japan","France","Ukraine","Germany","Republic of Korea","Brazil","Poland","United Kingdom" ))
I wanted to see which countries had the most suicides in this data set by sheer numbers.
whoDF_s8 <- TC%>%
group_by(country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_s8,aes(x=reorder(country,-total_suicides),y=total_suicides,fill=-total_suicides))+
geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x="Country",y="Count",title="Top Ten Countries")+
theme(plot.title = element_text(size=18,face="bold"))
Its Kinda a hodge podge. I guess Russia and Japan and USA makes sense but the rest of them (besides the European aspect) have very little in common.This breaks apart the idea that its because of single culture or economy or code of ethics. Every can point his finger at Japan and say " its very pressured there“, but that is a micro not macro observation.
df_top10 <- who %>%
filter( (country== "Russian Federation")| (country== "United States of America") | (country=="Japan")| (country=="France")| (country== "Ukraine") |(country=="Germany") |(country=="Republic of Korea")|(country=="Brazil")|(country== "United Kingdom")| (country== "Poland") )
df7<- who%>%
filter( (country=="Japan")| (country=="France")| (country== "Ukraine") |(country=="Germany") |(country=="Republic of Korea")|(country=="Brazil")|(country== "United Kingdom") )
whoDF_10 <- df_top10 %>%
group_by(year,country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(whoDF_10,mapping = aes(x=year, y=total_suicides, colour = country)) +geom_line(aes(linetype = country))
This is to show where the Na’s kick in.
hop <- df7 %>%
group_by(year,country) %>%
summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
arrange(desc(total_suicides))
ggplot(data = hop, mapping = aes(x =country , y = total_suicides)) +
geom_boxplot() +
coord_flip()
This is me taking out the top three and trying to find a pattern.
df_country_group <- who %>%
group_by(country) %>%
summarise(sumsui= sum( as.numeric(suicides_no)),popsum = sum(as.numeric(population)))
df_country_group <- na.omit(df_country_group)
df_country_group_ratio<- df_country_group %>% mutate(ratioSP =sumsui/popsum)
df_country_group_ratio <- df_country_group_ratio %>% arrange(-ratioSP)
df_country_group_ratio_top <- head(df_country_group_ratio,10)
So for this I wanted first to show something satistical.What I’ve done is instead of just having the countries with biggest numbers I wanted to get a ratio showing the countries that proportionaly have largest amounts of suicide.
df_country_group_ratio_top1 <- df_country_group_ratio_top %>%
arrange(ratioSP)
ggplot(df_country_group_ratio_top1,aes(x = ratioSP, y = country )) +
scale_y_discrete(limits= df_country_group_ratio_top1$country) +
geom_segment( aes(xend=0,yend=country),size = 1,color='red2')+
geom_point(fill="red2",color="green",size=4,shape=21,stroke=2) +
ggtitle("Countries with highest suicide to population ratio")+
labs(x="Ratio : Suicide/Population", y="Country")
Made it Interactive.
df_country_group_ratio_top1 <- df_country_group_ratio_top %>%
arrange(ratioSP)
ggplotly(
ggplot(df_country_group_ratio_top1,aes(x = ratioSP, y = country )) +
scale_y_discrete(limits= df_country_group_ratio_top1$country) +
geom_segment( aes(xend=0,yend=country),size = 1,color='red2')+
geom_point(fill="red2",color="red4",size=4,shape=21,stroke=2) +
#ggtitle("Countries with highest suicide to population ratio")+
labs(x="Ratio : Suicide/Population", y="Country")
)
Based on the 2016 National Survey of Drug Use and Mental Health it is estimated that 0.5 percent of the adults aged 18 or older made at least one suicide attempt. This translates to approximately 1.3 million adults. Adult females reported a suicide attempt 1.2 times as often as males. Further breakdown by gender and race are not available. This data set needed more varibles. Suprisied about middle age people being the most prone to suicide. In conclusion Suicide is epidemic with a large amount of causes but a dearth of solutions. There needs to be more research conducted on the subject and less preconcived notions.
I chose this data set primarily because suicide exists in the category of things, that while being researched, is one of those areas of suffering that we have yet to get a definite scientific handle on. Part of the reason is because there is no clear definite reason for what exactly causes suicide. Directly related to this is the mix of the numerous reasons and the degree of those reasons why one would get to the point of committing suicide. Human beings are complex creatures, and actively do and are acted upon for host of dependent and independent reasons; all the while interfacing with a world which can frankly be cruel at times. The biggest hindrance ironically is the very thing that give us our understanding. The human mind is as multivariate and complex as any weather pattern as powerful as any super-computer as scattered and seemingly random as the billions of causes and reasons that happens with man’s daily interaction with his world. All of what have stated above leads to a situation in which people can point fingers at the stereotypical and false outliers and saying “that’s what causes suicide”. What I wanted to show at the very least with this data set was the truly global nature of suicide and how it effects all races, groups countries regardless of socioeconomic standing. This is what the New York Times Reported In 2016;" When it comes to suicide and suicide attempts there are rate differences depending on demographic characteristics such as age, gender, ethnicity and race. Nonetheless, suicide occurs in all demographic groups“. Part of the problem is classification as the WHO Reports.”In 2015, 505,507 people visited a hospital for injuries due to self-harm. This number suggests that for every reported suicide death, approximately 11.4 people visit a hospital for self-harm related injuries. However, because of the way these data are collected, we are not able to distinguish intentional suicide attempts from non-intentional self-harm behaviors." In my project there were many obstacles to overcome such as; figuring out the Na’s screwed around with the plots. Another difficulty was just figuring out how to group everything. I would say my biggest challenge was when I mutated the last visualization to have a new variable to get the real ratio of suicide numbers in a country. My smallest challenge was accidently deleting my project this morning and trying to rush everything to get the project done by the dead line, “still have 20 minutes”.2:44 When cleaning the data, I did consider trying to retrieve the information, but it proved too cumbersome. That the same fate my met interactive map fell to I really wish I could have executed that properly. Its sad, partially because this data set was a bit bare bones it sort of conformed my preconditioned biases. The truth is there are some clear indicators of suicide there just often shaded by other unseen variables. I was surprised about how global suicide it like other human things it tragicaly knows no boundary nor country or coulter. The age thing also really shocked me who would have though middle age people would be the most prone to suicide. In conclusion I hope that there will be more research conducted about suicide and one day we will be rid of this very human curse.
Biblography: New York times United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/